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Abstract 

Background: Sequence signatures, as defined by the frequencies of /c-tuples (or /c-mers, /c-grams), have been used 
extensively to compare genomic sequences of individual organisms, to identify c/s-regulatory modules, and to study 
the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of 
metagenomic samples from a variety of different environments have been generated. The assembly of these reads 
can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the 
availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the 
complete genomes or existing databases and thus, can potentially be very useful for the comparison of 
metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the 
comparison of metagenomic samples have not been well studied. 

Results: We studied several dissimilarity measures, including d 2l d\ and d\ recently developed from our group, a 
measure (hereinafter noted as Had) used in CVTree developed from Hao's group (Qi et al., 2004), measures based 
on relative di-, tri-, and tetra-nucleotide frequencies as in Willner et al. (2009), as well as standard l p measures 
between the frequency vectors, for the comparison of metagenomic samples using sequence signatures. We 
compared their performance using a series of extensive simulations and three real next-generation sequencing 
(NGS) metagenomic datasets: 39 fecal samples from 33 mammalian host species, 56 marine samples across the 
world, and 13 fecal samples from human individuals. Results showed that the dissimilarity measure d\ can achieve 
superior performance when comparing metagenomic samples by clustering them into different groups as well as 
recovering environmental gradients affecting microbial samples. New insights into the environmental factors 
affecting microbial compositions in metagenomic samples are obtained through the analyses. Our results show that 
sequence signatures of the mammalian gut are closely associated with diet and gut physiology of the mammals, 
and that sequence signatures of marine communities are closely related to location and temperature. 

Conclusions: Sequence signatures can successfully reveal major group and gradient relationships among 
metagenomic samples from NGS reads without alignment to reference databases. The d\ dissimilarity measure is a 
good choice in all application scenarios. The optimal choice of tuple size depends on sequencing depth, but it is 
quite robust within a range of choices for moderate sequencing depths. 
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Background 

The study of dissimilarity between samples, known as 
beta-diversity (^-diversity), is crucial for studying micro- 
bial communities from environmental or human niches 
[1]. Beta-diversity gives a quantitative measure of differ- 
ences between two microbial samples, forming the basis 
for a quantitative comparison of multiple samples. The 
dissimilarity matrix defined by beta-diversity measures be- 
tween all pairs of samples in a set of multiple samples can 
be utilized to study group relationships of the samples [2] 
and to understand environmental gradients affecting mi- 
crobial samples [3]. 

Valuable insights into the diversity of hundreds of un- 
cultured microbial samples from various environments 
or niches have been provided by sequencing the small 
subunits of ribosomal RNA, specifically the 16S rRNA, 
with the conventional Sanger sequencing or the next- 
generation sequencing (NGS) technologies from a variety 
of environments, including soil [4,5], ocean [6-8], mam- 
malian gut [9], human skin [10-12], human gut [13-15], 
and human oral cavity [16-18], among many others. In 
16S rRNA-based surveys, several analytical procedures 
have been carried out to compare multiple microbial sam- 
ples using different beta-diversity measures [19]. The two 
general categories of beta-diversity measures include 
phylogenetic- and taxon-based methods. Using UniFrac 
[20-23] as an example, phylogenetic-based methods first 
generate a phylogenetic tree of sequences in each sample 
and then compare samples by counting overlaps of 
branches of their corresponding trees. Taxon-based meth- 
ods, on the other hand, calculate beta-diversity through 
binning sequences to Operational Taxonomic Units 
(OTUs), or assigning sequences to, for example, species or 
genera, and then comparing samples by counting overlaps 
in the taxa [24-26]. 

With the rapid development of NGS technology, 
whole metagenome shotgun sequencing (WMGS) is be- 
coming a new powerful approach to investigate complex 
microbial samples [27-31]. Metagenomics data provide 
more complete information on the microbial community, 
but beta-diversity measures for metagenomic sequencing 
reads from different microbial communities are signifi- 
cantly under- studied. A common practice of analyzing 
metagenomics data is to first map the short reads to 
known genes or pathways in existing databases, such as 
NR, KEGG or IMG and then compare their abundances 
between samples based on the mapped functional categor- 
ies (e.g., [31]). However, as a result of the incompleteness 
of microbial genomic annotation and function databases, 
only a small fraction of reads can be mapped to known 
genes and pathways, resulting in significant loss of infor- 
mation in the comparison of metagenomes. 

Genome sequence signatures refer to the frequencies of 
/c-tuples (/c-mers, /c-grams) in a genome. Previous studies 



have shown that /c-tuple frequencies are similar across dif- 
ferent regions of the same genome, but differ between gen- 
omes [32]. Sequence signatures have been widely used 
to study the evolutionary relationships among genomic 
sequences [33,34], to study horizontal gene transfer among 
different genomes [35] and to bin genome fragments from 
metagenomic samples [36,37]. In metagenomic studies, the 
number of organisms in the communities, the complete 
genome sequences of the organisms, and their abundance 
levels are usually not known. Thus, samples cannot be dir- 
ectly compared on the basis of abundance levels of organ- 
isms within communities. Methods based on mapping 
reads to known genes or functional categories are also 
restricted by the availability and completeness of existing 
databases. However, since genomes have their sequence 
signatures, differences in microbial compositions of sam- 
ples will result in differences in sequence signatures of the 
metagenomes. 

Thus far, sequence signatures have not been widely 
applied in the quantitative comparison of metagenomic 
samples, except a few studies that used di-, tri-, and tetra- 
nucleotide signatures (e.g., [38,39]). On the other hand, 
with the availability of NGS data, sequence- signature- 
based methods have high potential for the comparison of 
metagenomes. NGS technologies have provided efficient 
approaches to sample short reads of DNA sequences from 
the metagenomic communities. Maillet et al. [40] recently 
presented an algorithm to efficiently find similar reads be- 
tween two metagenomic datasets based on /c-tuple signa- 
tures. Without using complete genomes or information of 
any known genes, sequence signatures of the metagen- 
omes can be reflected from the NGS short reads and can 
be easily calculated. Thus, we can use sequence signatures 
to compare metagenomic samples from NGS short reads 
without doing sequence assembly or alignment. 

Key steps in sequence signature methods include the 
counting of frequencies of all /c-tuples to compose the se- 
quence signature vector for each sample, calculating the 
dissimilarity measure between samples based on their se- 
quence signature vectors, and analyzing relationships be- 
tween multiple samples based on dissimilarities of all 
sample pairs. In this study, we systematically investigated 
these key steps using a series of simulated and real metage- 
nomic datasets, giving special attention to the choice of dis- 
similarity measures, the length of the sequence signatures, 
and the model for the background genome sequences for 
the comparison of multiple metagenomic samples. 

Results 

We conducted a series of computational experiments by 
both extensive simulations and real data analyses to 
study the effectiveness of the sequence signature meth- 
ods in identifying group and gradient relationships of 
microbial community samples. We first simulated four 
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types of datasets to investigate factors that may affect 
the performance of the sequence signature methods. 
The simulated datasets were generated by sampling from 
mixtures of multiple true genomes with different abun- 
dance levels. We studied three d 2 -type dissimilarity 
measures, as defined in [34], based on /c-tuple count vec- 
tors and three dissimilarity measures defined on the 
basis of comparing the actual /c-tuple frequency vectors 
to evaluate the beta-diversity between different samples. 
We also studied the performance of a dissimilarity meas- 
ure used in CVTree (Had) [41] and measures based on 
di-, tri-, and tetra-nucleotide signatures (Willner) [38]. 
To calculate d 2 and d 2 , the expected count of the k- 
tuples under a suitable probability model for the back- 
ground sequences is needed. Here Markov models of 
different orders for the underlying background genomes 
were used to calculate the expected counts of 
/c-tuples. We studied a total of 14 dissimilarity measures 
(see Materials and Methods for details). Using the dis- 
similarity matrix obtained from any one of the measures, 
we studied the relationships of the samples. 

Simulation 1: detecting group relationships among 
metagenomic samples of relatively low complexity 

In some situations, metagenomic samples may form dif- 
ferent groups. For example, gut samples may group based 
on diet, and marine samples may group based on loca- 
tions. In order to evaluate the ability of the dissimilarity 
measures to detect such group relationships, we simulated 
datasets of 90 metagenomic samples belonging to 3 differ- 
ent groups (30 samples in each group) similar to the simu- 
lation method of Kuczynski et al [19]. Each sample was 
generated by simulating NGS short reads from a mixture 
of the genomes of 5 microbial species detected in soil [42] 
with different abundance levels (see Materials and Meth- 
ods for details). 

We applied the 14 dissimilarity measures based on se- 
quence signatures (defined in Materials and Methods) 
with different /c- tuple sizes k = 2—10 in this dataset to 
detect the group relationships of the 90 samples by clus- 
tering analysis. For a specific tuple size /<, /c- tuple count/ 
frequency vector of each sample was first calculated, and 
samples were compared with each other using each dis- 
similarity measure. We then used the UPGMA method 
[43] to cluster the samples based on the dissimilarity 
matrix. The parsimony test [25] was finally used to com- 
pare the derived clusters with the actual groups given in 
the simulation model. 

The number of sequences in each sample, termed se- 
quencing depth, may affect the accuracy of the sequence 
signature methods. We simulated three different sequen- 
cing depths: 1,000, 10,000, and 100,000 reads per sam- 
ple. To study the stochastic variation of the results, the 
simulation was repeated 100 times at each sequencing 



depth. As an example, Figure 1 shows one clustering re- 
sult of the 90 samples at sequencing depth of 10,000 
obtained with k = 5 and the dissimilarity measure 
which is the d 2 dissimilarity measure (Eq.2) under the 
Markov model of order 0 (M 0 ), the independent identi- 
cally distributed (i.i.d.) model. It can be seen that most 
of the samples in the original 3 groups can be clustered 
together. Compared with random clustering results, the 
parsimony test score is highly statistically significant 
with Monte Carlo p-vdlue <<0.001. 

We studied various factors, including the order of the 
Markov model for the background sequence, the tuple 
size /c, and sequencing depth, affecting the performance 
of a% and d 2 in recovering the group relationships of the 
samples. Figure 2 shows the relative performance of d 2 
and d 2 coupled with the i.i.d. model for the background 
sequence with other dissimilarity measures, including d 2 , 
Ma, Eu, Ch, Hao, and Willner, Overall, the dissimilarity 
measures d 2 , Ma, Eu, Ch, Hao and Willner do not per- 
form as well as d 2 and d 2 . The poor performance of d 2 
can be explained by the fact that it is dominated by the 
variation of the tuple occurrences within one sample, 
and is less affected by the relationship between the 
sequences in both samples. Meanwhile, the poor per- 
formance of Hao could be attributed to the high number 
of parameters that need to be estimated to fit a Markov 
model of order /c-2. Ch considers the maximum differ- 
ence between the tuple frequencies for the samples only 
and does not make full use of the information from all 
the tuples. Most importantly, the normalization of the 
tuple counts plays an important role in the superior per- 
formance of d 2 and d 2 . For each dissimilarity measure, 
the optimal tuple size giving the lowest parsimony score 
depends on the sequencing depth. For example, when 
the sequencing depth is 1,000, the optimal tuple size for 
d 2 , d 2 , Eu, and Ma is around 5 to 7. When sequencing 
depth is high (100,000 reads or higher), the parsimony 
score for a%, d 2 , Ma, and Eu decreases as the tuple size 
increases from 2 to 10. 

Next we studied the effect of the order of the Markov 
model for the background sequences on the parsimony 
score of d 2 and d 2 , and the results are given in Additional 
file 1: Figure SI. When the sequencing depth is low 
(< 10,000 reads), the i.i.d. model gives the lowest parsi- 
mony score in general. When the sequencing depth is high 
(> 100,000 reads), the effect of the order of the Markov 
model is negligible and the results are similar across the 
different orders (0, 1, 2, and 3), especially when the tuple 
size is relatively large (k > 6). One potential explanation 
for the above observations is as follows. When the sequen- 
cing depth is low, the absence of sufficient data to accu- 
rately estimate the many parameters in high-order 
Markov models results in the low accuracy to cluster the 
metagenomic samples. However, when the sequencing 
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Figure 1 Metagenomic samples can be clustered well using dissimilarity measure d\\M Q . Example of a clustering of 90 samples in 
Simulation 1 obtained at sequencing depth = 10,000 reads per sample, with k = 6 and dissimilarity measure d s 2 \M 0 . The true clusters are labeled 
by three colors and symbols (red triangles, green squares and blue discs). The parsimony test score is 10 in this example with p-value less 
than 0.001. 



depth is high, this is not a problem and Markov models of 
different orders give similar results. 

Finally, we studied the standard deviations of the parsi- 
mony score of d\ and the results are given in Additional 
file 1: Figure S2. From the figure we can see that at lower 
sequencing depth (1,000 sequences), the best performance 
is obtained at k = 5. When the sequencing is 10 times 



deeper, the best k becomes 9, and it further increases 
when the sequencing depth further increases. We can also 
see that the standard deviation of the parsimony score 
among the 100 repeated experiments at each sequencing 
depth is quite moderate and stable with different choices 
of /<, and deeper sequencing will decrease the standard de- 
viation of the parsimony scores, as expected. 




Jiang et al. BMC Genomics 2012, 13:730 
http://www.biomedcentral.eom/1 471 -21 64/1 3/730 



Page 5 of 1 7 



Simulation 2: revealing environmental gradients from 
metagenomic samples of relatively low complexity 

The second simulation experiment was designed to 
evaluate the effectiveness of sequence signature methods 
for analyzing gradient variation of microbial communi- 
ties. A set of 20 metagenomic samples was generated by 
simulating NGS reads from 5 soil bacterial species as in 
the above simulations [42] with varying abundance 
levels. We designed the proportion of the 5 genomes to 
vary from sample 1 to sample 20 in a way that mimics 
the situation of gradient variation across the samples, 
and we then studied how well the sequence signature 
methods reveal such gradient variations from the meta- 
genomic data. Dissimilarity matrices were calculated 
using different dissimilarity measures and different /<- 
tuple sizes as above. PCoA [44], an effective approach to 
display beta-diversity among multiple samples, mapped 
the 20 samples to a two-dimensional space. Then the 
Pearson Correlation Coefficient (PCC) was calculated 
between the first principal coordinate (PCI) given by 
PCoA and the predetermined gradient axis built in the 
simulation model. The PCC can be taken as an index of 
how well the sequence signature method reveals the gra- 
dient variation in samples (see Materials and Methods 
for details). A higher PCC indicates better performance 
of the dissimilarity measure in recovering the gradient 
among the microbial samples. 

Similar to Simulation 1, we repeated the simulation 
experiments 100 times at each of the three sequencing 
depths of 1,000, 10,000 and 100,000 reads per sample. 
Figure 3 shows the average PCC of the different dissimi- 
larity measures at different sequencing depths and tuple 
sizes. We can see that the dissimilarity measures per- 
form with greater differences than observed in the case 
of group relationships in Simulation 1. Specifically, the 
performance of d 2 and d 2 is quite robust with respect to 
the tuple size and the sequencing depth. In most situa- 
tions, d 2 and d 2 perform similarly manner, and out- 



perform other dissimilarity measures. The average PCCs 
for these two measures are larger than 0.75 when tuple 
size k is between 4 and 8, and the sequencing depth is 
10,000. Deeper sequencing can slightly increase the PCC 
but the improvement is not as significant as that 
observed in Simulation 1. Additional file 1: Figure S3 
shows that the performance of d 2 and d 2 is highly robust 
with respect to the order of Markov model used to cal- 
culate the expected occurrences of /c-tuples. Additional 
file 1: Figure S4 shows the means and standard devia- 
tions of PCC obtained using the dissimilarity measure d 2 
with tuple sizes k = 2—10 at different sequencing depths. 
We can see that the optimal tuple sizes, at lower or 
moderate sequencing depths, are between 4 and 7, and 
this range also gives the smallest standard deviation al- 
though the standard deviation is, in general, very small. 

Simulations 3 and 4: revealing group relationship and 
environmental gradients from metagenomic samples of 
relatively high complexity 

The above simulation experiments illustrated the per- 
formance of the various dissimilarity measures for mi- 
crobial communities of relatively low complexity. Most 
microbial communities are much more complex with 
hundreds to thousands of genomes. In order to see the 
performance of the dissimilarity measures for discover- 
ing group or gradient relationships among metage- 
nomic samples of high complexity, we simulated two 
NGS metagenomic datasets with more complex micro- 
bial compositions. The simulated communities consist 
of 113 microbial genomes from a collection of gen- 
omes given by the FAMeS dataset [45,46]. We simu- 
lated short reads with both Roche/454 and Illumina/ 
Solexa platforms. 

In Simulation 3, we generated 60 samples belonging to 
3 groups, as defined by different compositions of the mi- 
crobial genomes (see Materials and Methods for details). 
In Simulation 4, we simulated 20 samples related through 




° 2 4 6 8 10 ° 2 4 6 8 10 ° 2 4 6 8 10 



ktuple size ktuple size ktuple size 

Figure 3 The relative performance of various dissimilarity measures at different sequencing depths in recovering gradient 

relationships of metagenomic samples. The dissimilarity measures cT 2 and d\ outperform others in most situations. 
\ ) 
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a gradient (see Materials and Methods for details). As in 
Simulations 1 and 2, we generated datasets at three 
sequencing depths: 1,000, 10,000 and 100,000 reads 
per sample. For each setting, we generated 100 duplicated 
datasets to simulate possible stochastic effects in real 
NGS data. 

The complete simulation results are given in Additional 
file 1: Figures S5-S7. The qualitative results for the relative 
performance of the different dissimilarity measures are 
similar to that from the first two simulations for metage- 
nomic samples of relatively low complexity. We also chan- 
ged the read length for Roche/454 and Illumina/Solexa in 
our simulations and the results reported above continue 
to hold. 

Detecting group relationships among mammalian gut 
samples 

We applied the sequence signature methods to analyze a 
real mammalian gut metagenomic dataset by Muegge 
et al [31]. It includes the NGS short reads of 39 metage- 
nomic samples from 33 mammalian species of herbivores, 
carnivores and omnivores (Additional file 1: Table SI). 
Previous studies showed that the microbial compositions 
of omnivores are very diverse and cannot be distinguished 
from other samples [31]. Consequently, we first excluded 
the 11 omnivore samples and focused on the remaining 
28 herbivore and carnivore samples. These samples 

Table 1 Parsimony scores on the clustering of the three 
mammalian groups (hindgut-fermenting herbivores, 
foregut-fermenting herbivores and simple-gut) obtained 
by sequence signature methods using different 
dissimilarity measures based on the metagenomic data 
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The "Mi" indicates that the expected counts are calculated based on /'-th order 
Markov model for the background sequences. Monte Carlo p-values were 
estimated by comparing observed parsimony score to the scores in 1000 
randomly-joined trees: parsimony score = 2-7, p < 0.001; score = 8, p = 0.002. 
Boldfaced are the two lowest parsimony scores. 



include 3 groups: hindgut-fermenting herbivores (n = 8), 
foregut-fermenting herbivores (n = 13), and simple-gut 
carnivores (n = 7). We investigated how well the sequence 
signature methods identified these 3 groups from the 
NGS metagenomic data. Similar to the simulation studies, 
we used UPGMA to cluster the samples based on the dis- 
similarity matrix, as defined by different dissimilarity mea- 
sures based on sequence signatures, and we assessed how 
well the resulting cluster tree revealed the underlying 3 
groups by the parsimony test. 

The resulting parsimony scores for the different dissimi- 
larity measures and tuple size are summarized in Table 1. 
For most choices of the tuple size k and dissimilarity mea- 
sures, the sequence signature methods could group the 
samples well. This indicates significant group differences 
in sequence signatures of the metagenomes among 
hindgut-fermenting herbivores, foregut-fermenting herbi- 
vores and simple-gut carnivores. The smallest parsimony 
score (= 2, indicating best grouping of the samples) was 
achieved with a% coupled with the i.i.d. model for the 
background sequences (M 0 )and k = 5. When k varies be- 
tween 3 and 9, the parsimony scores obtained with |M 0 
are never greater than 4 and are always the smallest 
among those with other dissimilarity measures. This 
shows that |M 0 has the best performance among all the 
dissimilarity measures we considered and that the optimal 
performance is not very sensitive to the choice of tuple 
size k as long as it is within a reasonable range. When the 
tuple size k was increased to 10 or above, we also observed 
that the performances of all dissimilarity measures became 
worse. To explain, with a given data size and large /c, the 
number of occurrences of each /c-tuple will be too small, 
causing the result to be unstable. We also experimented 
with k = 2, which also resulted in poor performances. 
However, when k = 2, the sequence signature vector only 
has 16 dimensions and therefore cannot adequately reflect 
the diversity of the samples. 

Figure 4a shows the clustering tree obtained with tuple 
size k = 5 and |Af 0 . Clear separations among the three 
groups of samples can be observed. The relationships of 
the samples can also be visualized in 2-dimension by the 
PCoA plot in Additional file 1: Figure S8(a). From the 
PCoA plot, we can see that 3 of the 4 pairs of samples 
with the same host species are also clustered together: 2 
Okapis (Okapisl and Okapis2), 2 Bighorn Sheeps 
(BigHornl and BigHorn2) and 2 Rock Hyraxes (HyraxSD 
and HyraxSTL), while the 2 Lions (Lionl and Lion2) are 
separated. Moreover, the digestive physiology of the host 
samples (foregut-fermenting vs. hindgut-fermenting) is the 
most distinguishing information extracted by the sequence 
signature method. This observation was not reported from 
this dataset in the original study [31]. 

Then we added back the 11 omnivorous samples to the 
dataset and reanalyzed the data, and the result is shown in 
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(a) 




Urial2 (Transcaspian Urial Sheep; STL) 
Gazelle3 (Gazelle; STL) 
Okapil(Okapi;STL) 
0kapi2(0kapi;STL) 
Kroo3 (Kangaroo; SD) 
SpgbkW (Springbok, Wild) 
HyraxSTL (Rock Hyrax; STL) 
HyraxSD (Rock Hyrax; SD) 
Colobus(Colobus; STL) 
Giraffe2 (Giraffe; STL) 
VWPig (Visayam Warty Pig; SD) 
BigHornSD (Bighorn Sheep; SD) 
BigHornW3 (Bighorn Sheep; Wild) 
Horsel (Horse; Wild) 
AfElphSD3 (African Elephant; SD) 
ZebraSTLl (Zebra; STL) 
Orangl(Orangutan;STL) 
GorillaSTL (Gorilla; STL) 
BlackRhinol (Balck Rhinoceros; SD) 
Capybara (Capybara; STL) 
Rabbit (European Rabbit; STL) 
Armadillo (Armadillo; STL) 
Hyena2 (Hyena; STL) 
BushDogl(Bush Dog; STL) 
Lionl (Lion; STL) 
Lion2 (Lion; STL) 
PolarBr2 (Polar Bear; STL) 
Echidna (Echidna; STL) 



(b) 




Giraffe2 (Giraffe; STL) 
Gazelle3 (Gazelle; STL) 
Urial2 (Transcaspian Urial; STL) 
Okapil(Okapi;STL) 
0kapi2(0kapi;STL) 
Kroo3 (Kangaroo ; STL) 
SpgbkW (Springbok; Wild) 
HyraxSTL (Rock Hyrax; STL) 
HyraxSD (Rock Hyrax; SD) 
Colobus(Colobus; STL) 
BaboonW (Baboon; Wild) 
Chimpl (Chimpanzee; STL) 
Squirrel (Squirrel; STL) 
Armadillo (Armadillo; STL) 
BushDogl (Bush Dog; STL) 
Hyena2 (Hyena; STL) 
Lion (Lionl; STL) 
Rabbit (European Rabbit; STL) 
AfElphSD3 (African Elephant; SD) 
Horsel (Horse; Wild) 
ZebraSTLl (Zebra; STL) 
Capybara (Capybara; STL) 
BlackRhinol (Black Rhinoceros; STL) 
GorillaSTL (Gorilla; STL) 
Orangl (Orangutan; STL) 
BaboonSTL (Baboon; STL) 
Callimicos (Callimicos; STL) 
BlackLemur (Black Lemur; STL) 
Saki (Saki;STL) 

RTLemur (Ring-tailed Lemur; STL) 
VWPig (Visayam Warty Pig; SD) 
BigHornSD (Bighorn Sheep; SD) 
BigHornW3 (Bighorn Sheep; Wild) 
Chimp2 (Chimpanzee; STL) 
Lion2 (Lion; STL) 
PolarBr2 (Polar Bear; STL) 
BlackBr2 (Black Bear; STL) 
SpecBr2 (Spectacled Bear; STL) 
Echidna (Echidna; STL) 



Figure 4 Clustering results of the mammalian gut samples based on 5-tuples and dissimilarity measure c/f |M 0 : (a) without the 
omnivore samples; (b) with the omnivore samples. Green upward triangles: foregut-fermenting herbivore; yellow downward triangles: 
hindgut-fermenting herbivore; red discs: simple-gut carnivore; blue squares: simple-gut omnivore. 



Figure 4b. We can see that the omnivore samples are not 
grouped as a single cluster, but are scattered throughout 
the clustering tree. This suggests more diversity in the 
fecal microbiome of omnivores than that of herbivores or 
carnivores, consistent with previous observations based 
on 16S rRNA data [31]. Some omnivore samples are clus- 
tered together with samples of other diets. For example, 
samples from omnivorous bears (Spectacled Bear, SpecBr2 
and Black Bear, BlackBr2) share considerable similarity 
with those from carnivores, as previouslyreported in [9]. 
The PCoA plot including the omnivore samples is given 
in Additional file 1: Figure S8(b). 



Detecting group and gradient variations in global ocean 
metagenomics data 

We applied the sequence signature methods to analyze 
the metagenomics data of global ocean samples col- 
lected from different geographic locations and condi- 
tions by Rusch et al. [29]. The geographic locations of 
the samples include the Sargasso Sea, Caribbean Sea, 
Eastern Tropical Pacific and Tropical South Pacific, and 
the samples also contain variation in microbial habitat 
types (e.g., open and coastal) and environmental tem- 
peratures (north temperate, south tropical, and north 
tropical) (see Additional file 1: Table S2). All these 
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Figure 5 Cluster analysis of open ocean samples from different geographical locations, (a) Geographical locations of the 56 global ocean 
samples. The 23 open ocean samples are indicated as follows: Sargasso Sea (n=7, blue), Caribbean Sea (n=2, cyan), Eastern Tropical Pacific (n=4, 
green), and Tropical South Pacific (n=10, red), (b) Clustering results of open ocean samples based on 5-tuples and dissimilarity measure d s 2 \M 0 . 
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Table 2 Parsimony scores on the clustering tree of 23 
open ocean samples belonging to 4 groups (Sargasso 
Sea, Caribbean Sea, Eastern Tropical Pacific and Tropical 
South Pacific) obtained by sequence signature methods 
using different dissimilarity measures based on the open 
ocean metagenomics data 
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The "M|" indicates that the expected counts are calculated based on /'-th order 
Markov model for the background sequences. Monte Carlo p-values were 
estimated by comparing observed parsimony score to the scores in 10000 
random joined trees: parsimony score = 4-6, p < 0.001; score = 7, p = 0.004; 
score = 8; p = 0.003; score = 9; p = 0.12. Boldfaced are the two lowest 
parsimony scores. 

factors may influence the composition of seawater 
microbiomes. To avoid the interacting effect of loca- 
tions and habitat types, we applied the sequence signa- 
ture methods on the 23 open ocean samples and the 19 
coastal water samples separately. 

The 23 open ocean samples form four geographic 
groups: Sargasso Sea (n = 7), Caribbean Sea (n = 2), 
Eastern Tropical Pacific (n = 4), and Tropical South 
Pacific (n = 10) (Figure 5a). We conducted clustering 
analysis with sequence signatures on these samples and 
used the parsimony test to study how well the grouping 
information was revealed (Table 2). Again, for most 
tuple size values, it can be seen that df|M 0 achieves 
the lowest parsimony score among all the dissimilarity 
measures we studied, and also the clustering results are 
quite stable when the tuple size k is between 5 and 9. 
Figure 5b shows the clustering tree of the open ocean 
samples with k = 5 and df|M 0 . We observed that the 
three major groups identified by the sequence signature 
method reflect three major environmental temperature 
conditions: north temperate, north tropical, and south 
tropical. The samples from the Caribbean Sea were clus- 
tered together with those from the Eastern Tropical Pacific 
Ocean. They both lie between the Tropic of Cancer and 
the Equator. 



We also found significant separation among 19 coastal 
water samples (see Additional file 1: Table S3), mainly 
consisting of 9 samples from the North American East 
Coast and 6 samples from the Galapagos Islands in the 
Eastern Tropical Pacific (Additional file 1: Figure S9(a)). 
In the clustering tree obtained with 5-tuples and d\ \M 0 , 
coastal water samples from the North American East 
Coast are separated from those from the Galapagos 
Islands (Additional file 1: Figure S9(b)). However, we 
also see some possible errors in the clustering. For ex- 
ample, sample 20 from the south of Charleston, SC, was 
clustered with samples from the Galapagos Islands, and 
sample 39 from the Galapagos Islands was clustered to- 
gether with samples from the North American East 
Coast. These outliers deserve further investigation. 

We then pooled all 56 global ocean samples and car- 
ried out PCoA on all samples. Figure 6b gives the PCoA 
plot with k = 5 and <if|M 0 . It can be clearly seen that 
geographic location as the gradient primarily drives the 
samples in the PCoA plot. Samples along the back- 
diagonal of the PCoA plot are geographically located 
along the S -shaped line sequentially running through 
the Sargasso Sea (blue), North American East Costal 
(black), Caribbean Sea (cyan), Eastern Tropical Pacific 
(green), Galapagos Islands (pink), Tropical South Pacific 
(red) and Polynesian Archipelagos (yellow). It is worth 
mentioning that sample 42 is both located and PCoA 
plotted between the Galapagos Islands (pink) and the 
Tropical South Pacific (red), although it belongs to the 
Eastern Tropical Pacific group (green). However, the 
habitats (open ocean: filled circle; coastal: filled square; 
other: open circle) do not significantly affect the cluster- 
ing of the microbial communities. 

Experiments on the human gut metagenome data 

We applied cluster analysis based on sequence signa- 
tures of the 13 human gut metagenome samples pub- 
lished by Kurokawa et al [28]. One set of samples 
consists of4 unweaned infants, and the other consists 
of 7 adults and 2 weaned children [28]. The sample 
information is provided in Additional file 1: Table S4. 
Although the sample size is very small in this study, 
we asked whether sequence signature methods could 
still reflect meaningful relationships. Additional file 1: 
Table S5 gives the parsimony scores on the separation of 
these two groups by the clustering based on sequence 
signatures. Again, we see that <if \M 0 has the best per- 
formance for almost all values of tuple size k. The clus- 
tering tree and the ordination plots for the samples are 
shown in Additional file 1: Figure S10. It can be seen 
that the samples from the unweaned infants tend to 
cluster together but show more diversity than the adults 
and weaned children. 
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Figure 6 (See legend on next page.) 
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(See figure on previous page.) 

Figure 6 PCoA ordinates of all 56 global ocean samples are primarily driven by geographical location, (a) Geographic locations of all 
samples, (b) PCoA plot with k = 5 and dissimilarity measure d s 2 \M 0 . For the location, Sargasso Sea samples are colored blue, North America East 
Costal samples are colored black, Caribbean Sea samples are colored cyan, Eastern Tropical Pacific samples are colored green, Galapagos Islands 
samples are colored pink, Tropical South Pacific samples are colored red and Polynesian Archipelagos samples are colored yellow. For the habitat 
type, open ocean samples are marked as discs, coastal samples are marked as squares, and other habitat samples are marked with open circles. 



Discussion 

In this study, we examined the application of sequence 
signature methods for the comparison of microbial 
metagenomic samples with NGS read data. We studied 
both traditional and recently developed dissimilarity 
measures of sequence signature vectors, and we used 
Markov models of different orders to estimate the back- 
ground when a measure requires a background model. 
The dissimilarity measures, with a wide range of choices 
of tuple size /<, were compared on the basis of four sets 
of simulated metagenomics data and three sets of 
real metagenomics data, including metagenomes of gut 
samples of multiple mammalian species, water samples 
of global ocean, and human gut samples. The data repre- 
sent a wide spectrum of topics that can be studied with 
metagenomic approaches. 

In all the data we studied, sequence signature methods 
were able to reveal the major group and gradient vari- 
ation among the samples from the short reads of meta- 
genomic data. The performance of different dissimilarity 
measures varies with different choices of tuple size. The 
recently proposed d\ measure performs the best in all 
scenarios and the optimal tuple size for it is between 5 
and 9, for NGS data with moderate sequencing depth of 
about 10,000 sequences of 200nt. Its performance is not 
highly sensitive to the tuple size, making it easier to apply 
the method on real data. The experiments also show that 
new biological insights on factors affecting the /c-tuple 
compositions of the metagenomes can be obtained. In the 
mammalian gut data experiment, the first factor is the 
diet, followed by the digestive physiology. For the global 
ocean samples, the most important factor is location, fol- 
lowed by habitat. For the human gut samples, weaning is 
a key factor that distinguishes microbiomes of adults and 
children from those of infants. 

In summary, this work shows that the recently devel- 
oped dissimilarity measure d\ provides a powerful ap- 
proach for metagenomic sample comparison based on 
NGS shotgun reads. It is simple and computationally effi- 
cient, and it can reveal underlying relationships between 
microbial community samples. The rapid development of 
NGS technology has provided great opportunities for 
metagenome sequencing. Because availability of known 
microbial genome references are limited and de novo as- 
sembly of metagenome sequences is challenging, the 
alignment-free sequence signature method is a promising 
approach for analyzing multiple metagenomic samples. 



In this study, we concentrated on the comparison of 
metagenomic samples using alignment-free methods 
with sequence signatures. However, with respect to opti- 
mizing strategies for the comparison of metagenomic 
samples, several questions remain. First, we did not 
study how sequence signature methods compare with 
other methods using tag sequences such as 16S rRNA or 
its variable regions for microbiome comparison. Since 
the 16S gene is only a very small fraction of the meta- 
genome, our prior experiences showed that extracting 
16S sequences and classifying the samples using only the 
16S gene sequences do not work well. Only using the 
16S sequences of the metagenomic samples loses infor- 
mation, resulting in inaccurate clustering of the samples. 
Second, we do not know whether metagenome shotgun 
reads or assembled contig sequences should be used for 
metagenomic sample comparison. We did not conduct 
experiments to specifically address this question, but 
based on our experiments on the shotgun reads, se- 
quence signatures can be extracted reasonably well with 
NGS short reads without any assembly. Third, for most 
metagenomics studies, investigators are interested in 
what genes or pathways are active. To achieve this ob- 
jective, reads are usually mapped to the known genes or 
pathways, and then the samples are compared using the 
active genes or pathway. Such approaches can give im- 
portant insights about functions of genes and pathways, 
which sequence signature-based approaches cannot. On 
the other hand, not all genes/pathways are known, and 
typically only a relatively small fraction of the reads 
can be mapped to known genes/pathways. Therefore, 
such gene/pathway-based approach cannot make full 
use of the data for metagenome comparison. Sequence 
signature-based methods use all reads and can poten- 
tially reveal more accurate relationships of samples. Fi- 
nally, all our experiments were based on real datasets 
with all the potential complexities, such as sequence 
errors, heterogeneous sampling of reads from different 
organisms or different genomic regions, as well as biases 
arising from different techniques for sample preparation 
and sequencing. Despite all these potential complexities, 
it is surprising that the clustering of samples using se- 
quence signatures with the d\ measure still performs 
very well. Theoretical and further simulation studies to 
understand the effects of such complexities on the per- 
formance of different dissimilarity measures will be a 
topic for future research. The code for calculating the 
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three d 2 -type measures is provided at http://www-rcf. 
usc.edu/fsun/Programs/d2Meta/d2Metamain.htmL 

Conclusions 

With the rapid development of NGS technologies and 
applications, metagenomics is becoming an important 
approach for studying microbial communities in envir- 
onments and/or the human body. Analyzing shotgun 
metagenomic data by assessing similarity to existing 
genome sequence databases is limited by the imcomple- 
teness of these databases, especially for environments 
that have not been extensively studied. We conducted a 
systematic study on the application of sequence signa- 
ture methods for the comparison of metagenomic sam- 
ples, and found that, if dissimilarity measures are chosen 
properly, alignment-free methods based on sequence signa- 
tures can successfully reveal major group and gradient rela- 
tionships among metagenomic samples from NGS reads 
without using reference databases or sequence assembly. 
These observations showed that analysis of sequence signa- 
tures provides an additional tool to evaluate patterns in mi- 
crobial community data, and this tool is not subject to the 
biases in existing databases. 

Methods 

Simulated metagenomic datasets 

We simulated four NGS metagenomic datasets using 
MetaSim [47]. In the first simulated datasets, the simulated 
communities consist of 5 bacterial species: Acidobacterium 
capsulatum, Bacteroides fragilis, Nitrosospira multiformis, 
Proteus mirabilis and Sulfolobus islandicus found in a soil 
bacterial community from a previous study [42]. Two types 
of sample relationships were simulated: group relationship 
where the species abundance levels of the samples belong 
to different groups and gradient relationship where the 
species abundance levels change continuously with some 
variables such as temperature or location. 

In Simulation 1, we simulated 90 samples belonging to 
3 groups (shown in Additional file 1: Figure Sll). We 
began with a relative abundance vector of the 5 bacterial 
species (0.05,0.10,0.20,0.25,0.40), and each component was 
perturbed by multiplying the absolute value of a normally 
distributed variable with mean 1 and standard deviation 
equal to the value of the component itself. The relative 
abundance vector was renormalized to sum to 1. The ori- 
ginal species relative abundance vector was perturbed and 
renormalized in this manner for three times independ- 
ently, and three relative abundance vectors were obtained: 
(0.022, 0.058, 0.116, 0.507, 0.297), (0.042, 0.088, 0.281, 
0.244, 0.345) and (0.046, 0.066, 0.042, 0.320, 0.526). We 
used these three vectors as the species abundance vectors 
for the three group centers and further generated three 
groups of species relative abundance vectors around them. 
For each group center relative abundance vector, we 



added to each component the absolute value of a Gaussian 
noise of mean zero and standard derivation equal to each 
component and renormalized components to sum to 1. 
Each relative abundance vector was randomized and 
renormalized 30 times, and a total of 90 relative abun- 
dance vectors were obtained. With these relative abun- 
dance vectors, we generated 90 metagenomic samples by 
mixing the 5 bacterial genomes at the abundance levels 
defined in the vectors and sampling short reads from the 
mixture genomes with MetaSim [47]. The sampling pro- 
cedure mimics the next-generation sequencing by the 
Roche/454 platform with read length of 200nt. 

In Simulation 2, we simulated 20 samples consisting 
of the same 5 bacterial species as in Simulation 1 and 
the relative abundance vector of the 5 species shifts 
along a gradient axis. Among the 5 bacterial species, we 
set the abundance level of one of them as constant, and 
let the abundance levels of the other 4 species vary fol- 
lowing 4 functions across the sampling indexes. The four 
functions (shown in Additional file 1: Figure SI 2) have 
centers approximately around sample indexes 0, 5, 10 
and 15, respectively. The abundances of all 4 species 
were first taken from these functions and were then nor- 
malized together with the species with constant abun- 
dance to sum to 1, forming the relative abundance 
vectors. Absolute values of Gaussian noises were added 
to each component of the abundance vector, with mean 
0 and standard deviation equal to the value of each com- 
ponent. The vectors were renormalized after adding the 
noises. We generated 20 metagenomic samples by sam- 
pling the 5 bacterial genomes according to these relative 
abundance vectors using MetaSim [47] with read length 
of 200nt. 

In Simulations 3 and 4, we considered more complex 
communities consisting of 113 microbial genomes from 
the collection of genomes provided by the FAMeS data- 
set [45,46]. The relative abundance vectors of the micro- 
bial genomes were generated from the power-law 
(Zipfs) distribution: 



f{k-a,N) 



l/k a 



with a = 0.3,AT=113, and k = l,...N. 

In Simulation 3, we generated 60 samples belonging 
to 3 groups with 20 samples in each group. For each 
group, we randomly ordered the 113 genomes and 
assigned the z-th genome with relative abundance / (k;a, 
N). Then we added to each component the absolute 
value of a Gaussian noise with mean zero and standard 
derivation equal to each component and renormalized 
components to sum to 1. Each relative abundance vector 
was randomized and renormalized 20 times, and a total 
of 60 relative abundance vectors were obtained. Then 
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we used these relative abundance vectors to simulate 60 
metagenomic samples. 

In Simulation 4, we generated 20 samples consisting of 
the same 113 genomes , and the relative abundance vector 
of the 113 genomes were generated by the power-law 
(Zipf s) distribution as in the third simulation. In order to 
mimic the gradient model, the relative abundance vector 
shifts along a gradient axis of a from 0.275 to 0.75 by 
0.025. Again, absolute values of Gaussian noises were 
added to each component of the 20 abundance vectors, 
with mean 0 and standard deviation equal to the value of 
that component. The vectors were renormalized after add- 
ing the noises. We generated 20 metagenomic samples 
according to these relative abundance vectors using Meta- 
Sim. In order to see the effect of sequencing platform, we 
generated short reads through both Roche/454 and Illu- 
mina/Solexa platforms by their default parameters in 
Simulations 3 and 4. 

In all the simulations, we generated datasets at three 
sequencing depths: 1,000, 10,000 and 100,000 sequen- 
cing reads per sample. At each setting, we generated 100 
duplicated datasets to simulate possible stochastic effects 
in real NGS data. 

Real metagenomic datasets 

We analyzed three real shotgun metagenomic datasets 
published in recent years. 

The Mammalian Gut Dataset includes 39 fecal micro- 
biome samples from 33 mammalian species [31]. Shotgun 
sequencing of whole metagenome by the multiplex shot- 
gun Roche/454 FLX pyrosequencing platform produced a 
total of 2,163,286 reads [mean = 55,469±28,724(SD) per 
sample] (Additional file 1: Table SI). The read length is 
261±83(SD) nt. The animals in the dataset represent three 
diet types and varied digestive physiologies (hindgut- 
fermenting herbivores, n = 8; foregut-fermenting herbi- 
vores, n = 13; simple-gut carnivores, n = 7, and simple-gut 
omnivores, n = 11). As reported in previous studies with 
both targeted 16S rRNA sequencing and whole metagen- 
ome sequencing, the fecal microbiomes of herbivores and 
carnivores are associated with host diets and digestive 
physiologies [9]. Furthermore, bacterial compositions of 
omnivores are very diverse and cannot be distinguished 
from other groups of mammals [31]. Therefore, we stud- 
ied this dataset in two steps. First, we excluded the omni- 
vore samples and only studied the relationships among 
the groups of hindgut-fermenting herbivores, foregut- 
fermenting herbivores and simple-gut carnivores. Then we 
included the omnivore samples to study how the omni- 
vore samples clustered with the other samples. 

The Global Ocean Dataset includes 56 global ocean 
microbiome samples from 41 aquatic, largely marine 
locations across a multi-thousand kilometer transection 
from the North Atlantic through the Panama Canal and 



ending in the South Pacific [29]. Shotgun sequencing of 
whole metagenome by the AB3730xl platform produced 
7,697,926 reads [mean = 137,463±145,307 (SD) per sam- 
ple] (Additional file 1: Table S2). The read length is 
1,032±58(SD) nt. In addition to different geographic 
locations, these samples also represent a range of habitat 
types, such as open ocean (n = 23), coastal (n = 19), and 
estuary (n = 3). As reported in previous studies, the di- 
versity among samples shows significant separation be- 
tween groups located in temperate regions and those 
located in tropical regions, although certain outliers can 
be present [29]. In our study, we first applied the 
sequence signature methods to the 23 open ocean sam- 
ples and 19 coastal ocean samples separately to avoid 
possible interactive effects of sampling location and 
habitat types. Finally, the methods were applied to all 56 
samples to obtain a panoramic view of the microbial 
communities. 

The Human Gut Dataset includes 13 fecal micro- 
biome metagenomic samples from healthy Japanese indi- 
viduals, consisting of 7 adults, 2 weaned children, and 4 
unweaned infants (Additional file 1: Table S4) [28]. Shot- 
gun sequencing of the whole metagenome produced 
1,041,617 reads [mean = 80,124±2,619(SD) per sample] 
with read length of 756±162(SD) nt. These samples were 
collected from 2 families (Family I: 2 adults and 2 weaned 
children; Family II: 2 adults and 1 unweaned infant), 4 in- 
dividual adults, and 2 individual infants. The previous 
study revealed intriguing differences in microbiomes be- 
tween adults and unweaned infants and a high functional 
uniformity in adults and weaned children [28]. In our 
study, sequence signature methods were applied to detect 
the dissimilarity between the groups of samples. 

The /c-tuple count vectors and dissimilarity measures 

The comparison of metagenomic samples using se- 
quence signatures arises from the need for the compari- 
son of two microbial community samples without 
having to do alignment with reference genomes, which 
are often incomplete or unavailable. For every metage- 
nomic sample, we can obtain its /c-tuple count vector as 
the sequence signature. Based on these vectors, we can 
use a dissimilarity measure to evaluate the difference be- 
tween two samples. We analyze the relationships among 
multiple samples using the dissimilarity matrix com- 
posed of the dissimilarities between all pairs of samples. 

The first step is to count the number of occurrences 
of each /c-tuple. Reads in NGS data can come from ei- 
ther the forward strand or the reverse strand of a gen- 
ome. Because we do not know which strand a read 
comes from, we consider both a read and its comple- 
ment to take both strands into consideration. That is, we 
supplement the observed reads by their complements to 
form the read set on which the number of /c-tuple 
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occurrences are counted. For genome or metagenome 
data, we have a finite alphabet set S = {"A", " C ", " G " 
T "} and the set of all possible k-tuples W= {w e S*}. The 
numbers of occurrences of all the 4 k tuples of length k 
in all reads of a metagenome sample form the /c-tuple 
count vector of the sample. 

In this paper, we studied a collection of dissimilarity 
measures between /c-tuple count vectors, including three 
d 2 'type dissimilarity measures d 2 , d 2 , and d 2 [34]. For the 
calculation of d 2 and d 2l we need to centralize the real 
count of a tuple by deducting its expected number of 
occurrences under certain models for the underlying 
background genome sequences. We estimated the back- 
ground genome sequences using Markov models of orders 
0, 1, 2, and 3, respectively. In addition to the d 2 -type dis- 
similarities, we also studied the Euclidean, Manhattan, 
and Chebyshev distance measures for the /c-tuple fre- 
quency vectors. We also studied a dissimilarity measure 
developed from Dr. Hao s group (Hao) [41] and this meas- 
ure uses a Markov model of order /c-2 for the background 
sequences. We next describe these measures in detail. 

Let c x = (cx,i,cx,2, • • • , Cx,4*) and c y = ( c Y,i,c Ya , • • • , 
c Y ^) be the k- tuple count vectors of two metagenomic 
samples X and Y, respectively. The d 2 dissimilarity meas- 
ure [34] derived from the well-known D 2 statistic [48] is 



and 



D 2 (C X ,C Y ) = Z^^CXjCYj 

( \ 

D 2 (c x ,c Y ) 



(1) 



Dissimilarity measures d 2 and d 2 [34] derived from 
two variants of D 2 termed D 2 and D 2 [49], were also 
studied. To define d 2 and d 2) we first introduce the cen- 
tralized count variables by 

cx,i = c x ,i - n x px,i and c Y j = c Y j - n Y p Y j 
where p. fi is the probability of i th /c-tuple under the 
probability model (Markov model of order r=0, 1,2,3) for 



the background sequences and n. 
count of /c-tuples. Then we have 



c. i is the total 



D%(c x ,c Y ) 



i=l O 



CXjCyi 



I 



d*(c Xl c Y ) = - 




The ranges of both d 2 and d 2 are between 0 and 1. 
When the two samples are the same, both d 2 and d 2 are 
0. When the two samples are completely independent, 
the expected values of D 2 and D 2 are 0, and thus, the 
expected value of both d 2 and d 2 is 0.5. When the 
enriched tuples in the two samples complement each 
other, the value of both d 2 and d 2 is close to 0. There- 
fore, these measures can be used to measure the dissimi- 
larity between the samples. 

In addition to the newly developed d 2 -type dissimilarity 
measures, we also studied the standard Z^-norm distance 
measures for /c-tuple frequency vectors. We first normal- 
ized /c-tuple count vectors to /c-tuple frequency vectors 
whose components sum to l,fx = ^ and f Y = ^ 

where nx = cx,i and n Y = y^_ n ♦ Three clas- 

sical /p-norm distances including Manhattan (/j-norm), 
Euclidean (/ 2 -norm) and Chebyshev (/oo-norm), abbre- 
viated as Ma, Eu and C7z, respectively, are defined to 
compare /c-tuple frequency vectors. 



£ P (fx,tY)=(J2l 1 \fxi-fy 



i/p 



Ma(f x ,f Y ) = e^fxJr) = Ytifo ~ fY ''\ 



(4) 



Eu(f x ,f Y ) = e 2 (f x ,f Y ) 

= (z;>wd 2 ) 



1/2 



C/z(f x ,fr)=^oo(fx,fr) 



max \fxj 
i<;<4* 



(5) 



(6) 



d s 2 {cx,c Y )=\ 



\ \ \I C X.,+ C Y,,\ ' 1 \]<? X j + 4, J 



y -p 



4, 



(2) 



A dissimilarity measure developed by Qi et al. [41] from 
Dr. Hao s group is also studied. We use the corresponding 
author s last name Hao as the short form of this measure 
to simplify the notation. Hao is based on the relative differ- 
ence between the actual /c-tuple frequencies of each word 
with its expectation under a Markov model of order /c-2. 
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Hao - 



!m i i / jxa 1 

E[f X;i | M k _ 2 ] 1 J [ E[f Y>i | M k _ 2 } 



E 



fx,i 

E[f x>i | M k _ 2 ] 



i WE 



E[/r,i I M<_ 2 ] 



(7) 



A new dissimilarity measure developed by Willner 
et al [38] was also studied in this paper. The relative 
abundance odds ratio measure of dinucleotide was 
defined as p XY = J^- . The relative abundance odds 

ratio for tri-nucleotides was defined as y t Y7 = ^ YZ /^ Z , 

I AIZ, JXYJYZjXNZ 

and for tetra-nucleotides was defined as t xyzw = 

fxYZwfxYfxNzfxNMwfYzfYNwfzW ^ M represent any 

JXYZjXYNWjYZWjXjYJZjW 1 

nucleotide [32]. 

The new dissimilarity measures, termed Willner, based 
on the relative abundance odds ratios for di-, tri-, and 
tetra-nucleotides were defined as 



-T 



\pxrif) ~ Pxr(g)\ 



(8) 



(9) 



8*(f>g) = ^4 J2xYZw\ TXYZw( f"> ~ T *YZw(g)\ (10) 



For d\ and d* 2 and the Hao dissimilarity measures 
described above, the expected number of occurrences of 
each tuple needs to be calculated. To do this, we need to 
assume a model for the background sequence. In this 
study, we used Markov models of different orders (0, 1, 
2, and 3 for a% and d 2 , and k-2 for Hao) to model the 
background genome sequences. Markov model of order 
0 corresponds to the independent identically distributed 
(i.i.d.) model where genome sequences are generated as 
a series of independent and identically distributed ran- 
dom characters. Thus, for a /c-tuple w = WiW 2 • • • the 
expected frequency under the Markov model of order 0 
(M 0 ) is also the probability of w occurring under M 0 : 

E\f(wiW 2 ...w k )\M 0 ] =p w 

k 

= p(wiw 2 . . . w k ) = Y\p( w j) 

;=i 

where p(wj) is the probability of Wj in all the reads of a 
metagenomic sample. 

Under Markov model (M r ) of order 1< r < k-2 the 
expected frequency is 



E\f(wiW 2 ...w k )\M r ] 

k-r 

= p(wiW 2 . . . W r ) \\p(wj+ r \wjWj+i . . . Wj+ r -i) 
7=2 

where p(wjW 2 - • -W r ) is the initial distribution of con- 
secutive states WjW 2 . • -W r in all the reads of a metage- 
nomic sample, and p(w ; - + r \ WjWj + 1 ...Wj + r ^ 1 ) is the 
transition probability going from consecutive states 
WjWj + ! . . . Wj + r _ i to state Wj + r . 

Beta-diversity analysis and evaluation methods 

Detection of group relationships among microbial sam- 
ples and the identification of external gradients driving 
shifts in microbial community structure are two major 
types of analysis tasks in microbial community compari- 
son. Therefore, we evaluated the performance of the k- 
tuple dissimilarity measures in metagenomic comparison 
by assessing how well a method detects the known 
group relationships or identifies known environmental 
gradient. 

For clustering analysis, we detect group relationships 
among the microbial samples by applying the UPGMA 
(unweighted pair-group method with arithmetic means) 
algorithm [43] based on the between-samples dissimilar- 
ity matrices. It is a hierarchical clustering algorithm 
based on pair-wise dissimilarity matrix of multiple sam- 
ples, using average-linkage for measuring the dissimilar- 
ity of two clusters. We used the phangorn [50] package 
in R for this algorithm. We used the TreeClimber pack- 
age [25], a tool to compare clustering of microbial com- 
munities, to evaluate the resulting clustering trees by the 
parsimony test. The parsimony score measures how far 
away the clustering tree is from the true classification of 
the samples. If samples in the same group are clustered 
together in a clustering tree, the parsimony score is c-1 
where c is the number of groups. As the parsimony 
score increases, the clustering tree becomes increasingly 
different from the true grouping of the samples. Monte 
Carlo p-vdlue is used to see whether the observed parsi- 
mony score is statistically significant. Here, samples were 
randomly grouped to generate clustering trees 1,000 
times, and the p- value was approximated by the fraction 
of times that the parsimony score for the randomized 
trees is smaller than or equal to the parsimony score of 
the observed tree. 

For the study of gradient relationships among the 
samples, the shift of microbial samples is visualized by 
PCoA (Principal Coordinates Analysis) [44], which is a 
multi-dimensional scaling (MDS) method that converts 
between-sample dissimilarity matrix into two-dimensional 
(or three-dimensional) ordinates of samples and arranges 
the samples in ordinate space. We used the MASS 
[51] package in R for PCoA. Then, the influence of 
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environmental gradient(s) on microbial communities can 
be tested by calculating correlation, such as Pearson cor- 
relation coefficient (PCC), between PCoA axis and gradi- 
ent axis. In this way, the performance of the between- 
sample dissimilarity measures given by sequence signature 
methods can be evaluated if the gradient driving microbial 
communities is known. 
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